NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Model Lakes. In EDBT 2025.

https://doi.org/10.48786/edbt.2025.81

Pal, Koyena; Bau, David; Miller, Renée J (January 2025, OpenProceedings.org)
EDBT (Ed.)
Given a set of deep learning models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practi- tioners rely on manually-written documentation to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding mod- els become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We formalize key model lake tasks, including model attribution, versioning, search, and benchmarking, and discuss fundamental research challenges in the management of large models. We also explore what data management techniques can be brought to bear on the study of large model management.
more » « less
ALT-GEN: Benchmarking Table Union Search using Large Language Models

Pal, Koyena; Khatiwada, Ammod; Shraga, Roee; Miller, Renée J (August 2024, PVLDB Workshop on Tabular Data Analysis (TaDA))

We consider the table union search problem which has emerged as an important data discovery problem in data lakes. Semantic problems like table union search cannot be benchmarked using only synthetic data. Our current methods for creating benchmarks for this problem involve the manual curation and human label- ing of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how comprehensive the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create ta- bles with specied properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable, but related. We use this benchmark to provide new insights into the strengths and weaknesses of existing methods. We evaluate state-of-the-art table union search methods over both existing benchmarks and our new benchmarks. We also present and evaluate a new table search method based on large language models over all benchmarks. We show that the new benchmarks are more challenging for all methods than hand-curated benchmarks. We examine why this is the case and show that our new methodology for creating benchmarks permits more detailed analysis and com- parison of methods. We discuss how our generation method (and benchmarks created using it) sheds more light into the successes and failures of table union search methods sparking new insights that can help advance the eld. We also discuss how our benchmark generation methodology can be applied to other semantic problems including entity matching and related table search.
more » « less
Full Text Available
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State

https://doi.org/10.18653/v1/2023.conll-1.37

Pal, Koyena; Sun, Jiuding; Yuan, Andrew; Wallace, Byron; Bau, David (January 2023, Proceedings of the Conference on Computational Natural Language Learning (CoNLL))

Search for: All records